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Two relatively new topis for analysis of^ / 
compiled in evalluatioQ studies are presen^fed, Ihe.Hat 
Test--Eguating Study in Eeaddng, khoM as vthis- inchor^ TeisV Studty:^ 
produced tables of scpre-correspondence iwe^^^ eight readina^ _ 
cdmprejiensi<ni and vocabul 
States, Two types of ; tablets from^ t^ 
for edubatiohal evaluation St 

^equivalents raw scores for individual pupils for each of the 28 pairs 
of -readiftg cbaprehensio of these 

tables^ is present ed, A second set of equivalency tables, nearly ;V 
identica^l :;in structure/ shows equivalent /xaw-^scbre mec^^ 
on eadh of ^he ; 28 ifiii^s of Hbests . iSe^arat e tables ar^s provided 
study for ^indiyiduals and groups of pupils in gra?des^4^^ < 
i^^lflstrate^, the use of the Anchor Test . Study results in edu^ 
evaluations, four designs; arei considered. A second tool tonsidfered is 
sampling, /theory, UjOt o^f^ used tbi advantage in ediicatib^ 
evaluation.. For studies that involve populations exceeding 3 
of analysis; both matrix stapling jand i<es sampling should be < 

consi^eredi^E^fbre noes' "kre given for/ f urjtlier discussions . of these 
-methods. (EC) 
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. .. ClaiiitLng no\^ in evaluation is a dangeroias 

busimssv Ftlre often fcian not,, vjhat seems tb be ^an imKJvatiori today .is in , ; - 
i^ilyrk rediscovery of senna onc^-offcen-used-but-^ Ttipse . 

of:us in efvaluation can rnake but a sad' case for pur skill as arcMvists;- . 
or perhaps \je shoQld take additional course vjork in the history of education. 

^ " Two exam ?les come readily to itdnd s 'She educational assessnBrit;itDVQfT%nt . ; • 
ana criteriorir -referenced testing. It cannot be denied that large scale educa--^ 
tiohal assess >ehts have enjoyed widesnreaiil growth in. pqpularity during the past 
few years.V ^^ Dr. Tyler's leadership mth CM>E has cotb a vital ani, prani'sing . 
mtidnal. as^jsrtent program^ State departmehts of education have, -in just^ 
a few short yiar^, noved from open, resistance to large--scale oonpar^ . 
assessn^nt, tcr the investment of thousands, if r^t mllidns/ in their oim assessnm 
pax)graiif^. ''Laige sdiTO too, aare na^7 contracting mth.the inajbr - ■ 

. testing ccKparies to design and conduct district-^de assessments . One wuld 
sxjspect that these latter activities are mprecedented , and^ a natural outgrwtii 



pf the, national and state^d-de assessment piXKjrans. . But scxre day y*beh^>^ 



find yourself 



in a good library VTXth an hour to kill , pidc xxp the-1916 



Yearbook of the Hatiopal Society for tiie Study of Ed\jK^ation . In it you v>ill 
find the report of a: district-wide assessment, conducted by Eli-Tood P. Gubberly 
for Salt Lake City in ■the previous year. Are the modern assessments different? 
Certainly. ' But hardly to a degree that vxsuld svKjport /a claim to innovation. 
The case for novelty of critenorh-ref^ehi^ serious bla^ 

in 1972 by Peter Pdrasie(w and George Hadatas , in an article* in lieasureirent in ' 
Bd^^im. They cite' a statement by E.L. djef ining the difference 
betereen criterion-referenoed arid rbrm-referenced testing, ^and then note the use . 
of criterLon-referenced 'rteasijrfesrtent:.in a 1916 study by the Boston Public Schools. 
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Presented at the 1974 Amial tfeeting of the zsraerican Ediicational^; I^^ 
Association, .Chicago, Illinois. : ; / . • 



The Boston : study . (by P, TJ, Ballcw) is tlie Pifteentii Yearbodt 

of the N&tional Society fc»r 

Standards and Tests for the Iteasiorenent of the Efficiency of Schools 
ScAooI Systems. . V/.*- " 

nith this bacKgroima of claimed ihnoi/ation rent asiiixler/ i^L is ^l^lv same . 
trepiclation that I propose pK> relatively i^; toiols for evaluation studies . Let 
rte say at oi?teet that r^ith r^sresente^^ tedinolpgyr just 

cediires that haven' t.been^^A^ a great degree in past evalua;tiqn stixlies. 
Unless I. hayen't fb^ - • 

!pie ,Na;ldcna^ Test-Equating Stxidy in Reading r irore^ kna^ as the \ 

flrwchor Testr-Stucty asrrpletBf!^ Ss reported in an AERA syitposium . 

held last year, (Bianchihi, 1973? Lovet, 1973; Jaeger/ 1973) the Stu# 
poxxJuced tables of scorer-c»rresponderK:e befet^^een tfa^ ei^t reading ocKjxrdiension . 
and vocabqlary^^^^.te^^ vrf.dely usee? in tte United; States; Vflien the final 



report on the Study is released by tlie uis. Of f ice of Education later tliis . 
year, two types of tables should be rrosVxaselui for educational evaluation ■ 
studies. Fisrst are tables that sh^ equivalent raw scores for individual 
pupils for eac?i of ta^'errtg^^i^t pairs of reading ooinprehension and vocabulaiy 
tests. The stinK:ture of these tables is shown in T&ble ll : A second set of 
equivalency tables, nearly identical in structure^ equiyal raw-score * 

neans fiDr sc^ools^ on eadi of the twenty-eight pairs of tests.. ^SeparSte tables 
are provided fcy the Study for individuals ard groups of pupils in^gra^es four, , 
five and .six.. 

To iilusixate tile use of ^jnchca: Test Study results in educational evalu- 



ation, consider four . designs proposed by Anclrew Porter in an MRA papef 

delivered last year (Porter, 1973) . Consist^t wiiix Porter's paper, I id.ll 

. label these desi^s 

' Situation A and ^Situation B. Case I represents the rarely-occurring 

evaliaat^r'^ dream where^ imits of smalysis are raiv^^ assigned to an ^ 

O ikperiifnental groi^ group.. An observation on a vsuri^ 

^i^^B trade at* the outset of t^^ e^qoeriri^snt, the treatm^ be ^ " ^ 



. evaluaterl "is apnlied to ths ejiperimental groi:^, ancl a final obsei^Tation ' /. 

■ ' : '■^^'.\'f' \-' ' 

is tiien. wade on both the e>5)erirnmt3l gro^ and the control group. *]Ms is. 

nothing itOEDe than the .rana(a^ twp-gro\:5> 'design suggested 

Csuof^toell aiid Stanlev^^ (1963)^ . ri^ults with: ttds 

design r the e^q^erin^nty at i t^^ of 

the experinentr or at both tim^'^' .ra reading carprehension or reading ^ ' 

yocabuiary* Pori^ Case II r^teseh^s thfe^i^ situation x^^Jerein 

units of analysis are assigned to an ^qpeHi^ 

group in sate purposeful (nonrirandom) A^Tay. Again, pre-treatrrent and post- 
treati^rit cbservatioins are made on Jx>tli groups^^^^ a nte-post ti^?o-^o 

&slgn without random " . ' . 

Situation A ocx:urs wlien the teats ; used for pre-tteatinent iteasurCTent and ' 
post-treatment raeasurenent are parallel, and Situation B . occurs diJEferent> 
or non-paraiiel , measurerrents are BBde pi^treatment and post-rtreatment . ^^Oiese 
designs are shown in the Campbell ^^^^^^^ Table 2. ' • ; ' 

Po3Cter examii^ strategies for each of 

tiiese designs^ Analysis of coyarianoe with a randm^oovariate/ an^ 
variance with an index of response as the dependent variable, repeated measures 
analysis of variance, and, analysis of oovarianoe witil estditated true-scores 

For .Cas^ I ^ Situation A desigiis (rafdom assignment a^ use of parallel 



/ 

pre-^ and post- tests) , analysis of A^iance on the dependent variables Post-r 
^teat score nanus vdthxnr-group reliability tiineSc pr^-test score (an index of . 
response) was found tb be th^ irost efficient analysis^proc^ure/ To. he 
efficient, this, procedure requires that' the reliability of 'the prie and post s 
neasures be ^khown. I^hen reliability is not analysis of oovariai)ce * . • 

using pre-test soDres as a covariate is recorirended by Porter. . ; 
When pupils are the urdts of analysis, the Ahcto Test Study tables of 



score-corr^spbn&rcer can facilitate vise cf lile strategies 
reocfl^fiended by Porter. Pw^test nieasxares and post-test maasmres can ccxisist 
of scores on any of tlie eii^t reading ccnprehension tests or any of the eight 
VQcabii Rx^choT: Prior to analysis/ tlie 

j,vgoores of irxaiyidial piip^ converted to scores 0:1 a siiKjl^ test^^^ 

tisijigiiaie'^^eguating tables^ 
' ^ The procedure for oonversiOTi of scores; is quite siitple. :First7 the tes^^^^^ 
^to be used in the final Analysis ittust be selected. A logical choice is 
the test^for v*4ch \ t^^ nuitber of data are available. Oonversibn 

of scores from one tes error of equating to the ' 

dnmipresent error of ireasurerent . ^ M the stanaard. isrror of ^uatiiig 

is typically one-fourth to a ras^soor^ypDint (carrai^ 

errors of neasursrent typically in the range ttjo to foxir icav^-soore points^^^^ 
the ^iio types of errors are cunwlative ani^ 
the t^t .to be used in the analysis has been selecrt:ek, t^^ 
tables are xased to convert sc^^ ^n oil od^^ tes^^ tb scDrec , ^ 
on the analysis test. If test results |oi: pupils a^^ aya:LlcLble /in raw-^ 



form, the Anchor Test Study tables can be xosed directly; 



If results are/. 



iavailabie only^ standfird-scsore form (such as peroeiitilje ranks or 

grade equivalent lanits) ; publishers ' norms itiust be ysed to donvert back 
to equivalent raw scores/ prior to using the Anchor Test Slidy equating tableisx 

' ■ ^ The Aiidhor Test Study provides precise estirates of parallel^fcHfe ^tiSt- . 
retest -for the ei^t riding oorro^ vocabulary subtests equated If ; 

the experiiTieiital an^ control groi:ps used in llie evaliiation "can be assxiraed to 
be equal in heterogeneity on these measures to this natiors^Tide populations of 
fourth, fifth, or sixth-graders/ the-ArKiior Test r^ 

be . used with the analysis of vaidanoe on index cjEf response! If _ scores on • 
different tests are converted to scores on a single test, thfe reliability of 
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■ the txoled ^masiKes- mj^^^ ficm^^^a^ of the y ^ '^ 

: reliabilities reported in the Jindtior Test Starty, .The x^ig^ be eopplied 
%3uia^ @qi.Tal the proportions of scx>i^vavailabie on each test equated,. . : * 
/^Althou)^ this procedure vjould generate some error, the test-retest reliabilities 
•dif foe by ito more than 0 . 08 aca?oss tests ^ and for. the oOTprehenisiigi subtests - 
xjsed with TX3EiIs;;rinCg^ 



' tikv-efore be TniniD^ 



^ ^ ; „ Vtieai schools-^are-used as of analysis in a Case I - Situation A 

. evaluation design/ the Tijxiior TSist Study equating tables for school means might 
prove loseful.; Procedures for tjonversion of scores \'JOuld be identical to 
those used with the tables for indivicliials It: is a seestdng paradox of 
, classical, test theory that the areliabilities of grCup means ^ are no larger 

than; the reliabilities H of . i^^dividuai'^cbre iixiLviduals are randoraly : 

jf^^' ^''asMc^ to grdi:5>s;_33ie:. standard error of mGasurenent is siirJLler- for grc^ 

scores, but that is of no consequenqe for the analysis of variance on -an index, 
of response. Given the assurtption of ranaom assignment then, the^&ichor Test : 
Sta3<^ reliability, esti^ miqiat stdll be xiseful \*en sdicbls are-^V^ 



analysis. 

For Case I - Situation B ddsigns . Porter reccartm^rids use^of analysis^of ■ 
covariance VTith pre^test scores , as the covariate. Ihis is somewilat less 
efficient than analysis of r variance on an index of response,\ particularly when 
sample sizes are small. Itojever, pre-test and post-test measures are 

different, the regression coefficient of post-test pre-t:est \«^ill probably 
be uritaKJwn^maJdj^ tlie MKNA proceciare. less efficient. Using tiie ,?^iTchor Test * 



~~'^"StSi^'eguating tables, . sane formerly' Situation B designs mi^t be ODXiverted to 
Sitruation A designs.- Ei^it different testis could be vised as pre-tareatiinent or 
post-treatment irieasures , .and a Situation A design would obtain, provided, the 
. proportion of scores obtained on each test vras a^^ttxwdmately the sa^^ 

' the pre-treatrtsnt euad post-treatmient measuresnents. 1h\:is the. Anchor Test Study 

■ ' "0 vV ■.■ ■. ■■ • .:>■ I ■ .\ " / ■ ■■■■ ■ - • " ' ■■ ■ v. • 
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titoies ifDuM peaitdt vise of a more ef f icdent pnalysis procedure. ; 

For both Casei II - Situation A and Case 11 - Si bjation B ctesigns , ■ Porter 
argues that axialysiB of oovariar^ with estimted true; scores, on . 
treatrreiit iteasui^e as the oovariate, is of ten idie method of dioioe, Sal . 
alternative qohter^^ Situation A designs is arialysis of variance 

of-gain-^TOres .---3^^ 



measure the same variable \;7ith equal reliability. / As in Case Ir the Anchor 
Test Study results -itdq^ use of this sinpler anal^is prooediire 

vdiere it \TOuld otherwise be ' inf easible . If readixig ooRpc^iension were Treasured 
usijfig a^ the ei^t Anchor Itest Stii^ subt^ts, the assunpti^ 
to 701^ sooiBS wuld prci>ably 

of pwpils received the same subtest as a^pre-test arid a post?-test. Pupil?' 
scores on the ei^^t catprehenbion subtests \>?oaLd be jronverted to scores 
on a single stJbtest using^\^ Ahdior Tfest Stu^^ equatii^g tables/ just as 
described for Case I designs. If it is xiecessary to uise analysis of <x)^^ 
wdltlj estinated true p(D&-test score as the oovariate ,^ihe^reli^ estimates 
provided by the Anchor Test Stwdy^ itdght again xaseful. The neqesssur^ 

assiiirption is that the test sobres of pvpils assigned to the experimental and^: ^ 
oontrpl groups have variance equal to that of the popula:tions of fourth; 
fifth or sixth-graders iit^^^ If a variety of Anchor Test Study . > 

stibtests are tised in the evaluation/ the reliability of scores in the; 
cpnverted-scxare pool could be G3t±.T-?=«^jd v.Qing liie vxac^ted sa^erago procxidiDre 
describiad above • The paraliel-forns test-retest reliabilities provicted by 
the Anchor Test Study are the type needed to properly estdirate t:rue-scpres' 



on the oovariate. ; : ' r 



A second tool that def^enres nore \ise th^^ it getn in/ovaluation studies is 
isai!plii>g tiieory. 2\lfih^^ sairpling throry has been wifely applied in 
sociological studi^^agricudtural research a^ populat:ion surveys, it is not • 



^ oieten used to advantage in educational, eyalmtdon^ I'm not clkiBTing tliat tt/g 
almys collect ineiasureraents on entire populations^ but 1±at.5r;e either: use 
; s^Liiple raii^ 

. sitrple randoni saitplingl In 1963 r <irgrha^ proposed that ev^uaticais could 
be conduct efficiently if 'saitples of pu^^ils o^rnplete^ sanples 

^_™_.X)f_test;:it 



increasing nuntoers of statkd:de assessments. Alex LaKAr vjill describe one such 
application lajt3r in tM program. Neither mtrix sanp^ 

; examirij^sanpliiig designs have been used to great advantage in. 1^ assessments . : 



or evalxiatibn and their data-oollectijbn efficiency has suffered for 

For evaluations that involve populaticffis' exceeding tliree hundired uiiits of 
analysis (either ^cpils^ classrocjrns or schools) , te^^ respon- 
dent siaiTOling sl^^ considered. Neither tooid^ ^^rell to a tw^ 
tninute discussion^ so I'll merely c^lybia: attention to work on matrix sartpling 
by Shoema^ (1970/ 1971) and Bunda (1973) reported in the Journal of Educa- 
tional Fteasurenenty and to iry own work on examinee samnling, re^i±ed at thisv - 
meeting last year (Jaeger,. 1973) . To provide a glance at the latter vbrk/^ 
Table, 3 sha^TS the nuntiers. of sixth-grade pi5>ils tot wv0.d have to sanpled, ' 
in order to estiinate the mean reading achievemsrit in a school district with 

: nSO sixth-graders r va grade equivalent imitis^ m 

dence. Required sairple sizes are shcwn^for s^^venteen different -sanpling and 
. estimation procedures. It is clear that siirrple randm sanpling is far from 
being, tlie npst efficient procedure. If you'd like more details/. I'd be happy . 
to serxi you the ^itire pap ^ / 

Ifeiving begun v4th liie coht^tion that not much is really nev? in the ^ I 

. tedinology of evaluation, I've triefl to sha^ ha^ two relatively old /4ools 
could be. i3sed in na-; v/ays. Altiiou^ the tedinologj^ of tost equati^ 

/^^^^irp^ dL rBtion:!! i?cale in the 



sigpificant tool for educational' evalioatoi:^. Sanpling th^Dry too, bars been 
avail^le to \as for decades. Peiiiaps can increase the efficiency of our 
evaluations by enjoying it wisely / 

Or perii^ there is a netv tedmology of evaliiation. I ^lare your antid- 
pation of the ooiping hour.; . ^ / , 

1. Airasian, Peter W, and Geotge F; .Madaus, "Criterion-referehoed testing in 
tiie diassroom/' ifeasurenient in Edaoation , rfay, 1972. . ; 

2. Bianchini,. Jdm C./ "^Jie; haticnai test-ggvjatirig st in readings, resxalts 
: ,of tiie st\i*^" , Presented iat the 1973 of the Air^can 

Educational ]Sesea2xh Asspri 

■ . ■ ■ ^y] • . - . ' .■: ' \ ' t . ■■ / ■ ■; ^ ^ ,. . 

3. Bvinda, i :ary^ Anne, "An investigation of an extension of iton sarrpling v*dLch 
yields individual scores, " Journal of Educational ifeasx^reiDent , 10^ (1973V/ 

^ ■^■pp.-117--13Q:. •>;;■ _ ., ' : \ : . ; . .\ 

4 . (::aQnnpfaell , Donald T • and Julian C. Stanley, "E^rifenital and quasi-- t 
»53eriiiental designs for re^ teaciaing," 

Teaching; ed; N.L. Gage. Ch^ Hci\ally/ I^^TT^TTT^^ . 

5. Cubberly, BlWood P. , "Use ©f standard testg at Salt Lake Gity , Utah 15th' 
Yearfaock of the National Society for the Study of Edixration/ Qiicagos Univ. 
of Qiicago PrtesV 1916, pp. 107-110. ^ 

6 . Jaeger, Richard Mi , '''M evaluation jDf sanpling designs for sdiool testing 
^feasure^^ent iji Ed 1973r Newr Orleans, La. ^ 

7. Jaeger, Richard M. > '^Ohe natidnai test-eqi^ origins 
of the stu«^ and its historical antecedents , " preseiited at the 1973 Annual 
^iaeting of 1±ie American Educational Research Association^ New Orl^anis, La, 

8. loret, Peter, ""Ihe national test-equating reading: administration 
of the stut^, " presented at ti-^.e 1973 Anhu^ Ifeeting of the . Anerican Bduca-^ 

■■■ \ tibnal Research Association, J 

■i- • . ■ • ■■ "'>• v-."- ■ ■ ■ .. 

9. Porter, Andret7;C.^ "Analysis strategies for sane conmon evaltation paradi<jiB" 
Ipresented, at the 1973 Anntial Ifeeting of the American Educational. Resea^ 
dissociation, Netsr Orleai^ La. . . \ 

. ■ -A . . ■- ■ ^ ■ • . . , ■ ■ ■ . ■ , ■ ■ ■ 

, X ■ . , ..: - • . . ■ •■■ /■• . ■ ■,. ■■ ■ • ■ ■/ ■ • : . .. 

10. Shoemaker, David II, , "AD.location of items and eKaminees in estiitatingS'a nona 
^ distribution by item-sairjiling" ,. Journal of Badcational Ifcasurement, 7/ CL970) " 

rn?. 123-120. ; ■ . > 

11.. Shoeraakei:, David M. , "Further results . on the standard erroi^s of estimate . 
assTCiated^mth. iteirfl-exar^^ sartpling ]yocedia::es,'^ Journal of Educational 
: I'feasurement , (1971), 215-220. 



.Table 3: Sizes of Samples Required vto Estimate Mean Readiiig Ac^ 

. withih + 0.2 Gra^jde Equivalent Units - with 95 Percent, Conf idence 



Sar^pltnr and Fstlr.ation ^rbceduro 

Simple Random Sampling. (SRS) ; 

Stratified Sampling by Lorge-Thorndike , 
Ability Test Scores - Six Strata: 

Proportional Mlocation . (Strat^^^ 

-f ; : Optimal Allpcatibn (Strat-opt) 

^' Linear Systematic Sampling: ^^ 

r -AlphabetljHDtder (LSS-alpha) . 



Required Sample. Size 



106 pupils' 



26 pupils 

25 pupil 9^ ■ 



59 pupils** 
59 pupils** 



Increasing Order , of. Lorge-Thorndike 
Scores (LSS^inc) , \ V 

Increasing Order of ^of Lorge-Thorndike 

Scores;^|;rid Corrections Used. 59 ^pupils**. 

in Alternate StrataHLSS-0,R) < 59 pupils** 



Order Reversed 



Centrally Located Systematic Samples (CSS) 

Balanced Systematic Sampling jCBSS) ' 

. Single Stage Cluster Sampling.! 

Unbiased Estimation/^Schpols Used as 
Clusters jCRSC-ischbols-nihb) 



Ratio Estimation, Schools Used as 
Clusters (RSC-schools-rat) 



Probabilities Prbportiofial to \ School 
Enrollments, Schools Used as \ 
Clusters (PPS-scho{)ls) 

•'■ ' ■ ■ ' '[ . ■ : ■ V ■ ' ' 

Probabilities Proportional to Fifth-Grade 
SCAT Score Totals, Schools Used as 
Clusters' (PPES-schools) , 



± "59 pupils** 
118 pupils ■ 

1041 pupils^- 
394 pupils 

577 pupils 



Unbiased Estimation, Classrooms Usee 
as Clusters (RSC-class-unb) \ 



\ 



Ratio Estiinatibn, Classrooms Used as 
flusters (RSC-cla^sp-rat)' 



Probabilities Proportional to Classroom 
Enrollments, Classrooms Used as Clusters 
(PPS-class) . \ 

Probabilities Proportional to/lorge-Thorndike 
Score Totals, Classrooms Used as Clusters. ^ 
(PPES-class) . ^ , 



236 pupils 
865 pupils 

262 pupils 
314 pupils 

[. 

53 pupils 



*Mldcity data, population size =1180 sixth-grade pupils. 
**Five percent is the smallest sampling fraction investigated. Smaller sampling 
J ictions might provide acceptable precision for these sampling method^'s, , 



